Biostat 203B Homework 3

Due Feb 23 @ 11:59PM

Author

Xier Lu UID:206331941

Display machine information for reproducibility:

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.2    fastmap_1.1.1     cli_3.6.2        
 [5] tools_4.3.2       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.8       
 [9] rmarkdown_2.25    knitr_1.45        jsonlite_1.8.8    xfun_0.41        
[13] digest_0.6.34     rlang_1.1.3       evaluate_0.23    

Load necessary libraries (you can add more as needed).

library(arrow)

Attaching package: 'arrow'
The following object is masked from 'package:utils':

    timestamp
library(memuse)
library(pryr)
library(R.utils)
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.25.0 (2022-06-12 02:20:02 UTC) successfully loaded. See ?R.oo for help.

Attaching package: 'R.oo'
The following object is masked from 'package:R.methodsS3':

    throw
The following objects are masked from 'package:methods':

    getClasses, getMethods
The following objects are masked from 'package:base':

    attach, detach, load, save
R.utils v2.12.3 (2023-11-18 01:00:02 UTC) successfully loaded. See ?R.utils for help.

Attaching package: 'R.utils'
The following object is masked from 'package:arrow':

    timestamp
The following object is masked from 'package:utils':

    timestamp
The following objects are masked from 'package:base':

    cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::compose()      masks pryr::compose()
✖ lubridate::duration() masks arrow::duration()
✖ tidyr::extract()      masks R.utils::extract()
✖ dplyr::filter()       masks stats::filter()
✖ dplyr::lag()          masks stats::lag()
✖ purrr::partial()      masks pryr::partial()
✖ dplyr::where()        masks pryr::where()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Display your machine memory.

#  eval: ture
memuse::Sys.meminfo()
Totalram:   18.000 GiB 
Freeram:   149.672 MiB 

In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.

Q1. Visualizing patient trajectory

Visualizing a patient’s encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient’s ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.

Q1.1 ADT history

A patient’s ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with subject_id 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient’s demographic information and the subtitle shows top 3 diagnoses.

Do a similar visualization for the patient with subject_id 10013310 using ggplot.

Hint: We need to pull information from data files patients.csv.gz, admissions.csv.gz, transfers.csv.gz, labevents.csv.gz, procedures_icd.csv.gz, diagnoses_icd.csv.gz, d_icd_procedures.csv.gz, and d_icd_diagnoses.csv.gz. For the big file labevents.csv.gz, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder labevents_pq available at the current working directory hw3, for example, by a symbolic link. Make your code reproducible.

####Answer

# eval: true
library(ggplot2)
library(dplyr)
library(arrow)

sid <- 10013310

# read the data
sid_patients <- read_csv("~/mimic/hosp/patients.csv.gz") |> 
  filter(subject_id == sid) |>
  print(width = Inf)
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 1 × 6
  subject_id gender anchor_age anchor_year anchor_year_group dod       
       <dbl> <chr>       <dbl>       <dbl> <chr>             <date>    
1   10013310 F              70        2153 2017 - 2019       2153-11-19
sid_admissions <- read_csv("~/mimic/hosp/admissions.csv.gz") |>
  filter(subject_id == sid) |>
  print(width = Inf)
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 3 × 16
  subject_id  hadm_id admittime           dischtime           deathtime
       <dbl>    <dbl> <dttm>              <dttm>              <dttm>   
1   10013310 21243435 2153-05-26 14:18:00 2153-06-05 19:30:00 NA       
2   10013310 22098926 2153-06-10 11:55:00 2153-07-21 18:00:00 NA       
3   10013310 27682188 2153-05-06 18:03:00 2153-05-13 13:45:00 NA       
  admission_type    admit_provider_id admission_location       
  <chr>             <chr>             <chr>                    
1 OBSERVATION ADMIT P78TNY            INFORMATION NOT AVAILABLE
2 OBSERVATION ADMIT P09IS0            INFORMATION NOT AVAILABLE
3 URGENT            P89ZCW            TRANSFER FROM HOSPITAL   
  discharge_location       insurance language marital_status race         
  <chr>                    <chr>     <chr>    <chr>          <chr>        
1 HOME HEALTH CARE         Medicare  ?        SINGLE         BLACK/AFRICAN
2 SKILLED NURSING FACILITY Medicare  ?        SINGLE         BLACK/AFRICAN
3 HOME HEALTH CARE         Medicare  ?        SINGLE         BLACK/AFRICAN
  edregtime           edouttime           hospital_expire_flag
  <dttm>              <dttm>                             <dbl>
1 2153-05-26 08:56:00 2153-05-26 16:33:00                    0
2 2153-06-10 10:40:00 2153-06-10 11:25:00                    0
3 2153-05-06 10:21:00 2153-05-06 18:28:00                    0
sid_adt <- read_csv("~/mimic/hosp/transfers.csv.gz") |>
  filter(subject_id == sid) |>
  print(width = Inf)
Rows: 1890972 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): eventtype, careunit
dbl  (3): subject_id, hadm_id, transfer_id
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 14 × 7
   subject_id  hadm_id transfer_id eventtype
        <dbl>    <dbl>       <dbl> <chr>    
 1   10013310 21243435    31696219 discharge
 2   10013310 21243435    31736720 ED       
 3   10013310 21243435    33511674 transfer 
 4   10013310 21243435    34848129 transfer 
 5   10013310 21243435    38910974 admit    
 6   10013310 22098926    31651850 transfer 
 7   10013310 22098926    32769810 admit    
 8   10013310 22098926    33278851 transfer 
 9   10013310 22098926    34063502 ED       
10   10013310 22098926    36029206 discharge
11   10013310 27682188    30077870 transfer 
12   10013310 27682188    30444898 discharge
13   10013310 27682188    31203589 admit    
14   10013310 27682188    35160955 ED       
   careunit                                        intime             
   <chr>                                           <dttm>             
 1 <NA>                                            2153-06-05 19:58:00
 2 Emergency Department                            2153-05-26 08:56:00
 3 Medicine/Cardiology                             2153-05-26 16:19:26
 4 Medicine/Cardiology                             2153-05-26 14:42:55
 5 Medicine/Cardiology                             2153-05-26 14:18:39
 6 Neuro Intermediate                              2153-06-12 16:31:33
 7 Neuro Surgical Intensive Care Unit (Neuro SICU) 2153-06-10 11:55:42
 8 Medicine                                        2153-06-16 19:03:14
 9 Emergency Department                            2153-06-10 10:40:00
10 <NA>                                            2153-07-21 18:02:28
11 Medicine/Cardiology                             2153-05-07 20:47:19
12 <NA>                                            2153-05-13 15:36:52
13 Coronary Care Unit (CCU)                        2153-05-06 18:28:00
14 Emergency Department                            2153-05-06 10:21:00
   outtime            
   <dttm>             
 1 NA                 
 2 2153-05-26 14:18:39
 3 2153-06-05 19:58:00
 4 2153-05-26 16:19:26
 5 2153-05-26 14:42:55
 6 2153-06-16 19:03:14
 7 2153-06-12 16:31:33
 8 2153-07-21 18:02:28
 9 2153-06-10 11:55:42
10 NA                 
11 2153-05-13 15:36:52
12 NA                 
13 2153-05-07 20:47:19
14 2153-05-06 18:28:00
sid_lab <- arrow::read_parquet("/Users/luxier/Desktop/203hw/hw3/labevents_pq/part-0.parquet") |> 
  filter(subject_id == sid) |>
  print(width = Inf)
# A tibble: 2,285 × 16
   labevent_id subject_id hadm_id specimen_id itemid order_provider_id
         <int>      <int>   <int>       <int>  <int> <chr>            
 1      153564   10013310      NA     4841989  50887 ""               
 2      153565   10013310      NA     8958046  50934 ""               
 3      153566   10013310      NA     8958046  50947 ""               
 4      153567   10013310      NA     8958046  51003 ""               
 5      153568   10013310      NA     8958046  51678 ""               
 6      153569   10013310      NA    10682517  50933 ""               
 7      153570   10013310      NA    11713499  51133 ""               
 8      153571   10013310      NA    11713499  51146 ""               
 9      153572   10013310      NA    11713499  51200 ""               
10      153573   10013310      NA    11713499  51221 ""               
   charttime           storetime          
   <dttm>              <dttm>             
 1 2153-05-06 10:30:00 NA                 
 2 2153-05-06 10:30:00 2153-05-06 11:22:00
 3 2153-05-06 10:30:00 2153-05-06 11:22:00
 4 2153-05-06 10:30:00 2153-05-06 11:41:00
 5 2153-05-06 10:30:00 2153-05-06 11:22:00
 6 2153-05-06 10:30:00 NA                 
 7 2153-05-06 10:30:00 2153-05-06 11:09:00
 8 2153-05-06 10:30:00 2153-05-06 11:09:00
 9 2153-05-06 10:30:00 2153-05-06 11:09:00
10 2153-05-06 10:30:00 2153-05-06 11:09:00
   value                                    valuenum valueuom ref_range_lower
   <chr>                                       <dbl> <chr>              <dbl>
 1 HOLD.  DISCARD GREATER THAN 24 HRS OLD.     NA    ""                  NA  
 2 5                                            5    ""                  NA  
 3 2                                            2    ""                  NA  
 4 ___                                          2.97 "ng/mL"              0  
 5 14                                          14    ""                  NA  
 6 HOLD.  DISCARD GREATER THAN 4 HOURS OLD.    NA    ""                  NA  
 7 1.90                                         1.9  "K/uL"               1.2
 8 0.2                                          0.2  "%"                  0  
 9 0.1                                          0.1  "%"                  1  
10 32.5                                        32.5  "%"                 34  
   ref_range_upper flag       priority comments
             <dbl> <chr>      <chr>    <chr>   
 1           NA    ""         STAT     "___"   
 2           NA    ""         STAT     ""      
 3           NA    ""         STAT     ""      
 4            0.01 "abnormal" STAT     "___"   
 5           NA    ""         STAT     ""      
 6           NA    ""         STAT     "___"   
 7            3.7  ""         STAT     ""      
 8            1    ""         STAT     ""      
 9            7    "abnormal" STAT     ""      
10           45    "abnormal" STAT     ""      
# ℹ 2,275 more rows
d_icd_procedures <- read_csv("~/mimic/hosp/d_icd_procedures.csv.gz")
Rows: 85257 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sid_procedures <- read_csv("~/mimic/hosp/procedures_icd.csv.gz") |>
  filter(subject_id == sid) |>
  left_join(d_icd_procedures, by = "icd_code")|>
  mutate(chartdate = as.POSIXct(chartdate)) |>
  print(width = Inf)
Rows: 669186 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): icd_code
dbl  (4): subject_id, hadm_id, seq_num, icd_version
date (1): chartdate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 9 × 8
  subject_id  hadm_id seq_num chartdate           icd_code icd_version.x
       <dbl>    <dbl>   <dbl> <dttm>              <chr>            <dbl>
1   10013310 21243435       1 2153-05-27 00:00:00 4A023N7             10
2   10013310 21243435       2 2153-05-27 00:00:00 B2111ZZ             10
3   10013310 21243435       3 2153-05-27 00:00:00 B241ZZ3             10
4   10013310 22098926       1 2153-06-10 00:00:00 03CG3ZZ             10
5   10013310 22098926       2 2153-06-10 00:00:00 3E05317             10
6   10013310 22098926       3 2153-07-15 00:00:00 0DH63UZ             10
7   10013310 22098926       4 2153-06-11 00:00:00 3E0G76Z             10
8   10013310 27682188       1 2153-05-06 00:00:00 027034Z             10
9   10013310 27682188       2 2153-05-06 00:00:00 B211YZZ             10
  icd_version.y
          <dbl>
1            10
2            10
3            10
4            10
5            10
6            10
7            10
8            10
9            10
  long_title                                                                    
  <chr>                                                                         
1 Measurement of Cardiac Sampling and Pressure, Left Heart, Percutaneous Approa…
2 Fluoroscopy of Multiple Coronary Arteries using Low Osmolar Contrast          
3 Ultrasonography of Multiple Coronary Arteries, Intravascular                  
4 Extirpation of Matter from Intracranial Artery, Percutaneous Approach         
5 Introduction of Other Thrombolytic into Peripheral Artery, Percutaneous Appro…
6 Insertion of Feeding Device into Stomach, Percutaneous Approach               
7 Introduction of Nutritional Substance into Upper GI, Via Natural or Artificia…
8 Dilation of Coronary Artery, One Artery with Drug-eluting Intraluminal Device…
9 Fluoroscopy of Multiple Coronary Arteries using Other Contrast                
d_icd_diagnoses <- read_csv("~/mimic/hosp/d_icd_diagnoses.csv.gz")
Rows: 109775 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): icd_code, long_title
dbl (1): icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
sid_diagnoses_icd <- read_csv("~/mimic/hosp/diagnoses_icd.csv.gz") |>
  filter(subject_id == sid) |>
  left_join(d_icd_diagnoses, by = "icd_code") |>
  print(width = Inf)
Rows: 4756326 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): icd_code
dbl (4): subject_id, hadm_id, seq_num, icd_version

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 71 × 7
   subject_id  hadm_id seq_num icd_code icd_version.x icd_version.y
        <dbl>    <dbl>   <dbl> <chr>            <dbl>         <dbl>
 1   10013310 21243435       1 I222                10            10
 2   10013310 21243435       2 I5023               10            10
 3   10013310 21243435       3 I428                10            10
 4   10013310 21243435       4 E1142               10            10
 5   10013310 21243435       5 E1165               10            10
 6   10013310 21243435       6 I213                10            10
 7   10013310 21243435       7 I110                10            10
 8   10013310 21243435       8 I2510               10            10
 9   10013310 21243435       9 M25511              10            10
10   10013310 21243435      10 E785                10            10
   long_title                                                                   
   <chr>                                                                        
 1 Subsequent non-ST elevation (NSTEMI) myocardial infarction                   
 2 Acute on chronic systolic (congestive) heart failure                         
 3 Other cardiomyopathies                                                       
 4 Type 2 diabetes mellitus with diabetic polyneuropathy                        
 5 Type 2 diabetes mellitus with hyperglycemia                                  
 6 ST elevation (STEMI) myocardial infarction of unspecified site               
 7 Hypertensive heart disease with heart failure                                
 8 Atherosclerotic heart disease of native coronary artery without angina pecto…
 9 Pain in right shoulder                                                       
10 Hyperlipidemia, unspecified                                                  
# ℹ 61 more rows
sid_adt %>%
  filter(eventtype != "discharge") %>%
  ggplot() +
  geom_segment(aes(
    x = intime, 
    xend = outtime, 
    y = "ADT", 
    yend = "ADT", 
    color = careunit, 
    linewidth = str_detect(careunit, "(ICU|CCU)")
    )) +
  guides(linewidth = "none") +
  geom_point(data = sid_lab, 
             aes(x = charttime, y = "Lab", color = "Lab"),
             shape = 3, size = 4) + 
  geom_point(data = sid_procedures, 
             aes(x = chartdate, y = "Procedure", color = "Procedure", shape = long_title),
             size = 4) +
  scale_y_discrete(limits = c("Procedure", "Lab", "ADT")) +
  labs(
    x = "Calendar Time",
    y = "",
    title = str_c("Patient ", sid, " ", ",", sid_patients$gender, " ,", " ", sid_patients$anchor_age, " years old", ",", " ", sid_admissions$race[1]),
    subtitle = str_c(sid_diagnoses_icd$long_title[1], "\n", sid_diagnoses_icd$long_title[2], "\n",  sid_diagnoses_icd$long_title[3]),
    shape = "Procedure",
  )+
  theme(
    legend.position = "bottom", 
    legend.box = "vertical", 
    legend.box.just = "center", 
    legend.text = element_text(size = 2)
  )
Warning: Using linewidth for a discrete variable is not advised.
Warning: The shape palette can deal with a maximum of 6 discrete values because more
than 6 becomes difficult to discriminate
ℹ you have requested 9 values. Consider specifying shapes manually if you need
  that many have them.
Warning: Removed 3 rows containing missing values (`geom_point()`).

Q1.2 ICU stays

ICU stays are a subset of ADT history. We will visualize the vitals of a patient during ICU stays.

This figure shows the vitals of the patient 10001217 during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.

Do a similar visualization for the patient 10013310. ####Answer

# eval: true
library(ggplot2)
library(dplyr)
library(readr)
library(tidyr)
library(arrow)
library(tidyverse)

sid <- 10013310

sid_icustays <- read_csv("~/mimic/icu/icustays.csv.gz") |>
  filter(subject_id == sid) |>
  print(width = Inf)
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 2 × 8
  subject_id  hadm_id  stay_id first_careunit                                 
       <dbl>    <dbl>    <dbl> <chr>                                          
1   10013310 22098926 32769810 Neuro Surgical Intensive Care Unit (Neuro SICU)
2   10013310 27682188 31203589 Coronary Care Unit (CCU)                       
  last_careunit            intime              outtime               los
  <chr>                    <dttm>              <dttm>              <dbl>
1 Neuro Intermediate       2153-06-10 11:55:42 2153-06-16 19:03:14  6.30
2 Coronary Care Unit (CCU) 2153-05-06 18:28:00 2153-05-07 20:47:19  1.10
sid_vitalsign <- arrow::read_parquet("~/desktop/203hw/hw3/chartevents.parquet") %>%
  filter(subject_id == sid, itemid %in% c(220045, 220179, 220180, 223761, 220210)) %>%
  group_by(subject_id, stay_id, charttime, itemid) %>%
  summarize(value = mean(as.numeric(value), na.rm = TRUE), .groups = "drop") %>%
  pivot_wider(
    id_cols = c(subject_id, stay_id, charttime),
    names_from = itemid,
    values_from = value
  ) %>%
  rename(
    HR = `220045`,
    NBPs = `220179`,
    NBPd = `220180`,
    temperature_f = `223761`,
    RR = `220210`
  ) %>%
  ungroup() %>%
  mutate(
    charttime = as.POSIXct(charttime, format = "%Y-%m-%d %H:%M:%S", tz = "UTC"),
    HR = as.numeric(HR),
    NBPs = as.numeric(NBPs),
    NBPd = as.numeric(NBPd),
    temperature_f = as.numeric(temperature_f),
    RR = as.numeric(RR)
  ) 

sid_vitalsign_long <- sid_vitalsign %>%
  pivot_longer(
    cols = c("HR", "NBPs", "NBPd", "temperature_f", "RR"),
    names_to = "vital_sign",
    values_to = "value"
  ) %>%
  drop_na(value)


ggplot(data = sid_vitalsign_long, aes(x = charttime, y = value, color = vital_sign)) +
  geom_line() + 
  geom_point( ) +
  facet_grid(vars(vital_sign), vars(stay_id), scales = "free", as.table = TRUE) +
  labs(title = "Patient ICU stays - Vitals",
       x = "Time", 
       y = "Value") +
  guides(color = FALSE) +
  scale_color_discrete(name = NULL) +
  theme_minimal() +
  theme(
    strip.background = element_rect(fill = "grey", colour = "grey"), 
    strip.text = element_text(colour = "white", size = 10),
    panel.border = element_rect(linetype = "solid", colour = "grey", fill = NA),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank()
     )
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
of ggplot2 3.3.4.

Q2. ICU stays

icustays.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are

zcat < ~/mimic/icu/icustays.csv.gz | head
subject_id,hadm_id,stay_id,first_careunit,last_careunit,intime,outtime,los
10000032,29079034,39553978,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2180-07-23 14:00:00,2180-07-23 23:50:47,0.4102662037037037
10000980,26913865,39765666,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2189-06-27 08:42:00,2189-06-27 20:38:27,0.4975347222222222
10001217,24597018,37067082,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-11-20 19:18:02,2157-11-21 22:08:00,1.1180324074074075
10001217,27703517,34592300,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-12-19 15:42:24,2157-12-20 14:27:41,0.9481134259259258
10001725,25563031,31205490,Medical/Surgical Intensive Care Unit (MICU/SICU),Medical/Surgical Intensive Care Unit (MICU/SICU),2110-04-11 15:52:22,2110-04-12 23:59:56,1.338587962962963
10001884,26184834,37510196,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-01-11 04:20:05,2131-01-20 08:27:30,9.171817129629629
10002013,23581541,39060235,Cardiac Vascular Intensive Care Unit (CVICU),Cardiac Vascular Intensive Care Unit (CVICU),2160-05-18 10:00:53,2160-05-19 17:33:33,1.3143518518518518
10002155,20345487,32358465,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-03-09 21:33:00,2131-03-10 18:09:21,0.8585763888888889
10002155,23822395,33685454,Coronary Care Unit (CCU),Coronary Care Unit (CCU),2129-08-04 12:45:00,2129-08-10 17:02:38,6.178912037037037

Q2.1 Ingestion

Import icustays.csv.gz as a tibble icustays_tble.
####Answer:

library(dplyr)
library(readr)
library(ggplot2)

icustays_tble <- read_csv("~/mimic/icu/icustays.csv.gz")
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Q2.2 Summary and visualization

How many unique subject_id? Can a subject_id have multiple ICU stays? Summarize the number of ICU stays per subject_id by graphs.
####Answer:

# eval: true
library(dplyr)
library(readr)
library(ggplot2)

number_unique_subjects <- icustays_tble %>%
  summarize(number_of_unique_subjects = n_distinct(subject_id))
number_unique_subjects
# A tibble: 1 × 1
  number_of_unique_subjects
                      <int>
1                     50920
icu_stays_per_subject <- icustays_tble %>%
  group_by(subject_id) %>%
  summarize(number_of_icu_stays = n()) %>%
  ungroup()

ggplot(icu_stays_per_subject, aes(x = number_of_icu_stays)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(
    title = "Distribution of ICU Stays per Subject",
    x = "Number of ICU Stays",
    y = "Frequency"
  ) +
  theme_minimal()

ggplot(icu_stays_per_subject, aes(x = factor(number_of_icu_stays))) +
  geom_bar(binwidth = 1, fill = "blue", color = "black") +
  labs(
    title = "Number of ICU Stays per Subject",
    x = "Number of ICU Stays",
    y = "Count of Subjects"
  ) +
  theme_minimal()
Warning in geom_bar(binwidth = 1, fill = "blue", color = "black"): Ignoring
unknown parameters: `binwidth`

There are 50920 unique subject_id. A subject_id can have multiple ICU stays. The distribution of ICU stays per subject_id is left-skewed, with most subject_id having only one ICU stay.

Q3. admissions data

Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/admissions/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/admissions.csv.gz | head
subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admit_provider_id,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag
10000032,22595853,2180-05-06 22:23:00,2180-05-07 17:15:00,,URGENT,P874LG,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,WIDOWED,WHITE,2180-05-06 19:17:00,2180-05-06 23:30:00,0
10000032,22841357,2180-06-26 18:27:00,2180-06-27 18:49:00,,EW EMER.,P09Q6Y,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-06-26 15:54:00,2180-06-26 21:31:00,0
10000032,25742920,2180-08-05 23:44:00,2180-08-07 17:50:00,,EW EMER.,P60CC5,EMERGENCY ROOM,HOSPICE,Medicaid,ENGLISH,WIDOWED,WHITE,2180-08-05 20:58:00,2180-08-06 01:44:00,0
10000032,29079034,2180-07-23 12:35:00,2180-07-25 17:55:00,,EW EMER.,P30KEH,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-07-23 05:54:00,2180-07-23 14:00:00,0
10000068,25022803,2160-03-03 23:16:00,2160-03-04 06:26:00,,EU OBSERVATION,P51VDL,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2160-03-03 21:55:00,2160-03-04 06:26:00,0
10000084,23052089,2160-11-21 01:56:00,2160-11-25 14:52:00,,EW EMER.,P6957U,WALK-IN/SELF REFERRAL,HOME HEALTH CARE,Medicare,ENGLISH,MARRIED,WHITE,2160-11-20 20:36:00,2160-11-21 03:20:00,0
10000084,29888819,2160-12-28 05:11:00,2160-12-28 16:07:00,,EU OBSERVATION,P63AD6,PHYSICIAN REFERRAL,,Medicare,ENGLISH,MARRIED,WHITE,2160-12-27 18:32:00,2160-12-28 16:07:00,0
10000108,27250926,2163-09-27 23:17:00,2163-09-28 09:04:00,,EU OBSERVATION,P38XXV,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2163-09-27 16:18:00,2163-09-28 09:04:00,0
10000117,22927623,2181-11-15 02:05:00,2181-11-15 14:52:00,,EU OBSERVATION,P2358X,EMERGENCY ROOM,,Other,ENGLISH,DIVORCED,WHITE,2181-11-14 21:51:00,2181-11-15 09:57:00,0

Q3.1 Ingestion

Import admissions.csv.gz as a tibble admissions_tble. ####Answer:

library(dplyr)
library(readr)
library(ggplot2)

admissions_tble <- read_csv("~/mimic/hosp/admissions.csv.gz")
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Q3.2 Summary and visualization

Summarize the following information by graphics and explain any patterns you see.

  • number of admissions per patient
  • admission hour (anything unusual?)
  • admission minute (anything unusual?)
  • length of hospital stay (from admission to discharge) (anything unusual?)

According to the MIMIC-IV documentation,

All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.

####Answer:

# eval: true
library(dplyr)
library(readr)
library(ggplot2)

admissions_per_patient <- admissions_tble %>%
  group_by(subject_id) %>%
  summarize(number_of_admissions = n()) %>%
  ungroup()

admissions_tble <- admissions_tble %>%
  mutate(admittime = as.POSIXct(admittime, format = "%Y-%m-%d %H:%M:%S"),
         dischtime = as.POSIXct(dischtime, format = "%Y-%m-%d %H:%M:%S"))

#number of admissions per patient
ggplot(admissions_per_patient, aes(x = number_of_admissions)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(
    title = "Distribution of Admissions per Patient",
    x = "Number of Admissions",
    y = "Frequency"
  ) +
  theme_minimal()

#admission hour
admissions_tble <- admissions_tble %>%
  mutate(admittime_hour = hour(admittime))

admission_hour <- admissions_tble %>%
  summarize(min_admission_hour = min(admittime_hour), max_admission_hour = max(admittime_hour))

ggplot(admissions_tble, aes(x = admittime_hour)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(title = "Distribution of Admission Hours", x = "Hour of Admission", y = "Count")

#admission minute
admissions_tble <- admissions_tble %>%
  mutate(admittime_minute = minute(admittime))

ggplot(admissions_tble, aes(x = admittime_minute)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(
    title = "Distribution of Admission Minute",
    x = "Admission Minute",
    y = "Frequency"
  ) +
  theme_minimal()

#length of hospital stay
admissions_tble <- admissions_tble %>%
  mutate(
    length_of_stay = as.numeric(dischtime - admittime, units = "days")
  )

ggplot(admissions_tble, aes(x = length_of_stay)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +
  labs(
    title = "Distribution of Length of Stay",
    x = "Length of Stay",
    y = "Frequency"
  ) +
  theme_minimal()

The distribution of admissions per patient is left-skewed, with most patients having only one admission. The distribution of admission hours is uniform, with no unusual patterns. The distribution of admission minutes is distributed with unusual patterns. The distribution of length of hospital stay is left-skewed, with most patients having a length of stay less than 30 days.

Q4. patients data

Patient information is available in patients.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/patients/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/patients.csv.gz | head
subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
10000032,F,52,2180,2014 - 2016,2180-09-09
10000048,F,23,2126,2008 - 2010,
10000068,F,19,2160,2008 - 2010,
10000084,M,72,2160,2017 - 2019,2161-02-13
10000102,F,27,2136,2008 - 2010,
10000108,M,25,2163,2014 - 2016,
10000115,M,24,2154,2017 - 2019,
10000117,F,48,2174,2008 - 2010,
10000178,F,59,2157,2017 - 2019,

Q4.1 Ingestion

Import patients.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/patients/) as a tibble patients_tble.
####Answer:

library(dplyr)
library(readr)
library(ggplot2)

patients_tble <- read_csv("~/mimic/hosp/patients.csv.gz")
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Q4.2 Summary and visualization

Summarize variables gender and anchor_age by graphics, and explain any patterns you see.

##Answer:

# eval: true
library(dplyr)
library(readr)
library(ggplot2)

#gender
gender_distribution <- patients_tble %>%
  count(gender) %>%
  mutate(percentage = n / sum(n) * 100)

ggplot(gender_distribution, aes(x = gender, y = percentage, fill = gender)) +
  geom_bar(stat = "identity") +
  labs(title = "Gender Distribution", x = "Gender", y = "Percentage (%)") +
  theme_minimal()

#anchor_age
age_distribution <- patients_tble %>%
  count(anchor_age)

ggplot(age_distribution, aes(x = anchor_age, y = n)) +
  geom_bar(stat = "identity", fill = "blue", color = "black") +
  labs(
    title = "Distribution of Anchor Age",
    x = "Anchor Age",
    y = "Number of Patients"
  ) +
  theme_minimal()

female patients are much more than male patients. The distribution of anchor age is left-skewed, with most patients being less than 20 years old and older than 60 years old.

Q5. Lab results

labevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are

zcat < ~/mimic/hosp/labevents.csv.gz | head

d_labitems.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/d_labitems/) is the dictionary of lab measurements.

zcat < ~/mimic/hosp/d_labitems.csv.gz | head

We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz that only containing these items for the patients in icustays_tble. Further restrict to the last available measurement (by storetime) before the ICU stay. The final labevents_tble should have one row per ICU stay and columns for each lab measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq folder available at the current working directory hw3, for example, by a symbolic link.

####Answer:

# eval: true
library(dplyr)
library(arrow)
library(readr)
library(tidyr)

labevents <- arrow::read_parquet("/Users/luxier/Desktop/203hw/hw3/labevents_pq/part-0.parquet")
d_labitems <- read_csv("~/mimic/hosp/d_labitems.csv.gz")
Rows: 1622 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): label, fluid, category
dbl (1): itemid

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
lab_item_ids <- c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)

labevents_filtered <- filter(labevents, itemid %in% lab_item_ids)
labevents_before_icu <- labevents_filtered %>%
  inner_join(icustays_tble, by = "subject_id") %>%
  filter(storetime < intime)
Warning in inner_join(., icustays_tble, by = "subject_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1905 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
labevents_last_measurement <- labevents_before_icu %>%
  group_by(subject_id, stay_id, itemid) %>%
  slice_max(order_by = storetime, n = 1, with_ties = FALSE) %>%
  ungroup()

labevents_tble <- labevents_last_measurement %>%
  pivot_wider(
    id_cols = c(subject_id, stay_id),
    names_from = itemid,
    values_from = valuenum,  
    values_fill = list(valuenum = NA)  
  ) %>%
  rename(
    creatinine = `50912`,
    potassium = `50971`,
    sodium = `50983`,
    chloride = `50902`,
    bicarbonate = `50882`,
    hematocrit = `51221`,
    wbc = `51301`,
    glucose = `50931`
  ) %>%
  collect()

labevents_tble
# A tibble: 68,467 × 10
   subject_id  stay_id bicarbonate chloride creatinine glucose potassium sodium
        <dbl>    <dbl>       <dbl>    <dbl>      <dbl>   <dbl>     <dbl>  <dbl>
 1   10000032 39553978          25       95        0.7     102       6.7    126
 2   10000980 39765666          21      109        2.3      89       3.9    144
 3   10001217 34592300          30      104        0.5      87       4.1    142
 4   10001217 37067082          22      108        0.6     112       4.2    142
 5   10001725 31205490          NA       98       NA        NA       4.1    139
 6   10001884 37510196          30       88        1.1     141       4.5    130
 7   10002013 39060235          24      102        0.9     288       3.5    137
 8   10002155 31090461          23       98        2.8     117       4.9    135
 9   10002155 32358465          26       85        1.4     133       5.7    120
10   10002155 33685454          24      105        1.1     138       4.6    139
# ℹ 68,457 more rows
# ℹ 2 more variables: hematocrit <dbl>, wbc <dbl>

Q6. Vitals from charted events

chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head

d_items.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.

# eval: false
zcat < ~/mimic/icu/d_items.csv.gz | head
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

We are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble. Further restrict to the first vital measurement within the ICU stay. The final chartevents_tble should have one row per ICU stay and columns for each vital measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make chartevents_pq folder available at the current working directory, for example, by a symbolic link. ####Amswer:

# eval: true
library(dplyr)
library(arrow)
library(readr)
library(tidyr)

chartevents <- arrow::read_parquet("~/desktop/203hw/hw3/chartevents.parquet")
d_items <- read_csv("~/mimic/icu/d_items.csv.gz")
Rows: 4014 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): label, abbreviation, linksto, category, unitname, param_type
dbl (3): itemid, lownormalvalue, highnormalvalue

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
icustays_tble <- read_csv("~/mimic/icu/icustays.csv.gz")
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
items_filter <- filter(d_items, itemid %in% c(220045, 220179, 220180, 223761, 220210))
chartevents_filtered <- filter(chartevents, itemid %in% items_filter$itemid)

chartevents_within_icu <- chartevents_filtered %>%
  inner_join(icustays_tble, by = c("subject_id", "hadm_id")) %>%
  filter(charttime >= intime & charttime <= outtime)
Warning in inner_join(., icustays_tble, by = c("subject_id", "hadm_id")): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 4524 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
first_vitals_within_icu <- chartevents_within_icu %>%
  select(-stay_id.y) %>%  
  rename(stay_id = stay_id.x) %>%  
  group_by(stay_id, itemid) %>%
  slice_min(order_by = charttime, n = 1) %>%
  ungroup()

chartevents_tble <- first_vitals_within_icu %>%
  pivot_wider(
    id_cols = c(subject_id, stay_id),
    names_from = itemid,
    values_from = value,
    values_fill = list(valuenum = NA) 
  ) %>%
  rename(
    heart_rate = `220045`,
    non_invasive_blood_pressure_systolic = `220179`,
    non_invasive_blood_pressure_diatolic = `220180`,
    temperature_f = `223761`,
    respiratory_rate = `220210`
  ) %>%
  collect()
chartevents_tble
# A tibble: 73,164 × 7
   subject_id  stay_id heart_rate non_invasive_blood_pr…¹ non_invasive_blood_p…²
        <dbl>    <dbl> <chr>      <chr>                   <chr>                 
 1   12466550 30000153 104        113                     77                    
 2   13180007 30000213 74         168                     65                    
 3   18421337 30000484 106        101                     75                    
 4   12207593 30000646 100        107                     66                    
 5   12980335 30001148 80         102                     48                    
 6   12168737 30001336 65         110                     52                    
 7   17371178 30001396 86         169                     103                   
 8   16513856 30001446 82         75                      56                    
 9   17461994 30001471 91         154                     75                    
10   10656173 30001555 77         129                     72                    
# ℹ 73,154 more rows
# ℹ abbreviated names: ¹​non_invasive_blood_pressure_systolic,
#   ²​non_invasive_blood_pressure_diatolic
# ℹ 2 more variables: respiratory_rate <chr>, temperature_f <chr>

Q7. Putting things together

Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are all ICU stays of adults (age at intime >= 18) and columns contain at least following variables

  • all variables in icustays_tble
  • all variables in admissions_tble
  • all variables in patients_tble
  • the last lab measurements before the ICU stay in labevents_tble
  • the first vital measurements during the ICU stay in chartevents_tble

The final mimic_icu_cohort should have one row per ICU stay and columns for each variable.

####Answer:

# eval: true
library(dplyr)
library(lubridate)

patients_tble <- read_csv("~/mimic/hosp/patients.csv.gz")
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
admissions_tble <- read_csv("~/mimic/hosp/admissions.csv.gz")
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
icustays_tble <- read_csv("~/mimic/icu/icustays.csv.gz")
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#intime>=18
icustays_tble <- icustays_tble %>%
  left_join(patients_tble, by = "subject_id") %>%  
  mutate(
    intime_year = year(intime),  
    age_at_intime = anchor_age + (intime_year - anchor_year)  
  ) %>%
  filter(age_at_intime >= 18)

#combine all the tables
mimic_icu_cohort <- icustays_tble %>%
  left_join(admissions_tble, by = c("subject_id", "hadm_id")) %>%
  left_join(patients_tble, by = "subject_id") %>%
  left_join(labevents_tble, by = c("subject_id", "stay_id")) %>%
  left_join(chartevents_tble, by = c("subject_id", "stay_id")) %>%
  rename(
    anchor_age = anchor_age.x,
    anchor_year = anchor_year.x,
    anchor_year_group = anchor_year_group.x,
    gender = gender.x,
    dod = dod.x
  ) %>%
  select(-c(anchor_age.y, anchor_year.y, anchor_year_group.y, dod.y, gender.y)) %>%
  collect()

mimic_icu_cohort
# A tibble: 73,181 × 42
   subject_id  hadm_id  stay_id first_careunit last_careunit intime             
        <dbl>    <dbl>    <dbl> <chr>          <chr>         <dttm>             
 1   10000032 29079034 39553978 Medical Inten… Medical Inte… 2180-07-23 14:00:00
 2   10000980 26913865 39765666 Medical Inten… Medical Inte… 2189-06-27 08:42:00
 3   10001217 24597018 37067082 Surgical Inte… Surgical Int… 2157-11-20 19:18:02
 4   10001217 27703517 34592300 Surgical Inte… Surgical Int… 2157-12-19 15:42:24
 5   10001725 25563031 31205490 Medical/Surgi… Medical/Surg… 2110-04-11 15:52:22
 6   10001884 26184834 37510196 Medical Inten… Medical Inte… 2131-01-11 04:20:05
 7   10002013 23581541 39060235 Cardiac Vascu… Cardiac Vasc… 2160-05-18 10:00:53
 8   10002155 20345487 32358465 Medical Inten… Medical Inte… 2131-03-09 21:33:00
 9   10002155 23822395 33685454 Coronary Care… Coronary Car… 2129-08-04 12:45:00
10   10002155 28994087 31090461 Medical/Surgi… Medical/Surg… 2130-09-24 00:50:00
# ℹ 73,171 more rows
# ℹ 36 more variables: outtime <dttm>, los <dbl>, gender <chr>,
#   anchor_age <dbl>, anchor_year <dbl>, anchor_year_group <chr>, dod <date>,
#   intime_year <dbl>, age_at_intime <dbl>, admittime <dttm>, dischtime <dttm>,
#   deathtime <dttm>, admission_type <chr>, admit_provider_id <chr>,
#   admission_location <chr>, discharge_location <chr>, insurance <chr>,
#   language <chr>, marital_status <chr>, race <chr>, edregtime <dttm>, …

Q8. Exploratory data analysis (EDA)

Summarize the following information about the ICU stay cohort mimic_icu_cohort using appropriate numerics or graphs:

  • Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)

  • Length of ICU stay los vs the last available lab measurements before ICU stay

  • Length of ICU stay los vs the average vital measurements within the first hour of ICU stay

  • Length of ICU stay los vs first ICU unit

####Answer: - Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)

# eval: true
library(dplyr)
library(ggplot2)
library(tidyr)

#Length of ICU stay `los` vs demographic
ggplot(mimic_icu_cohort, aes(x = race, y = los)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Length of ICU Stay by Race", x = "Race", y = "Length of Stay (days)")

ggplot(mimic_icu_cohort, aes(x = insurance, y = los)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Length of ICU Stay by Insurance", x = "Insurance", y = "Length of Stay (days)")

ggplot(mimic_icu_cohort, aes(x = marital_status, y = los)) +
  geom_boxplot() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Length of ICU Stay by Marital Status", x = "Marital Status", y = "Length of Stay (days)")

ggplot(mimic_icu_cohort, aes(x = gender, y = los)) +
  geom_boxplot() +
  labs(title = "Length of ICU Stay by Gender", x = "Gender", y = "Length of Stay (days)")

ggplot(mimic_icu_cohort, aes(x = age_at_intime, y = los)) +
  geom_point() +
  labs(title = "Length of ICU Stay by Age at Intime", x = "Age at Intime", y = "Length of Stay (days)")

white will stay longer in ICU. other insurance will stay longer in ICU. married will stay longer in ICU. the people who are older will stay longer in ICU.

  • Length of ICU stay los vs the last available lab measurements before ICU stay
# eval: true
library(dplyr)
library(ggplot2)
library(tidyr)

lab_vars <- c("creatinine", "potassium", "sodium", "chloride", "bicarbonate", "hematocrit", "wbc", "glucose")

mimic_icu_cohort_long <- mimic_icu_cohort %>%
  gather(key = "lab_measure", value = "lab_value", one_of(lab_vars))

ggplot(mimic_icu_cohort_long, aes(x = los, y = lab_value)) +
  geom_point() +
  facet_wrap(~ lab_measure, scales = "free_y") +
  labs(title = "Length of ICU Stay by Last Available Lab Measurements", 
       x = "Length of Stay (days)", 
       y = "Lab Measurement Value") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
Warning: Removed 60686 rows containing missing values (`geom_point()`).

most of the last available lab measurements before ICU stay are associated with the length of ICU stay. creatinine, glucose and wbc are associated with the length of ICU stay.

  • Length of ICU stay los vs the average vital measurements within the first hour of ICU stay.
# eval: true
library(dplyr)
library(ggplot2)
library(tidyr)


mimic_icu_cohort0 <- mimic_icu_cohort %>%
  mutate(
    heart_rate = as.numeric(heart_rate),
    non_invasive_blood_pressure_systolic = as.numeric(non_invasive_blood_pressure_systolic),
    non_invasive_blood_pressure_diatolic = as.numeric(non_invasive_blood_pressure_diatolic),
    temperature_f = as.numeric(temperature_f),
    respiratory_rate = as.numeric(respiratory_rate)
  )

mimic_icu_cohort1 <- mimic_icu_cohort0 %>%
  group_by(subject_id, stay_id) %>%
  summarise(
    avg_heart_rate = mean(heart_rate, na.rm = TRUE),
    avg_sys_bp = mean(non_invasive_blood_pressure_systolic, na.rm = TRUE),
    avg_dia_bp = mean(non_invasive_blood_pressure_diatolic, na.rm = TRUE),
    avg_temperature_f = mean(temperature_f, na.rm = TRUE),  
    avg_respiratory_rate = mean(respiratory_rate, na.rm = TRUE)
  )
`summarise()` has grouped output by 'subject_id'. You can override using the
`.groups` argument.
mimic_icu_cohort1 <- mimic_icu_cohort0 %>%
  group_by(subject_id, stay_id) %>%
  summarise(
    avg_heart_rate = mean(heart_rate, na.rm = TRUE),
    avg_sys_bp = mean(non_invasive_blood_pressure_systolic, na.rm = TRUE),
    avg_dia_bp = mean(non_invasive_blood_pressure_diatolic, na.rm = TRUE),
    avg_temperature_f = mean(temperature_f, na.rm = TRUE),  
    avg_respiratory_rate = mean(respiratory_rate, na.rm = TRUE),
    los = los,
    .groups = 'drop'
  )

ggplot(mimic_icu_cohort1, aes(x = avg_heart_rate, y = los)) +
  geom_point() +
  labs(title = "Length of ICU Stay by Average Heart Rate", x = "Average Heart Rate", y = "Length of Stay (days)")
Warning: Removed 18 rows containing missing values (`geom_point()`).

ggplot(mimic_icu_cohort1, aes(x = avg_sys_bp, y = los)) +
  geom_point() +
  labs(title = "Length of ICU Stay by Average Systolic Blood Pressure", x = "Average Systolic Blood Pressure", y = "Length of Stay (days)")
Warning: Removed 979 rows containing missing values (`geom_point()`).

ggplot(mimic_icu_cohort1, aes(x = avg_dia_bp, y = los)) +
  geom_point() +
  labs(title = "Length of ICU Stay by Average Diastolic Blood Pressure", x = "Average Diastolic Blood Pressure", y = "Length of Stay (days)")
Warning: Removed 983 rows containing missing values (`geom_point()`).

ggplot(mimic_icu_cohort1, aes(x = avg_temperature_f, y = los)) +
  geom_point() +
  labs(title = "Length of ICU Stay by Average Temperature", x = "Average Temperature", y = "Length of Stay (days)")
Warning: Removed 1362 rows containing missing values (`geom_point()`).

most of the average vital measurements within the first hour of ICU stay are not associated with the length of ICU stay.

  • Length of ICU stay los vs first ICU unit
# eval:true
ggplot(mimic_icu_cohort, aes(x = first_careunit, y = los)) +
  geom_boxplot() +
  labs(title = "Length of ICU Stay by First ICU Unit",
       x = "First ICU Unit",
       y = "Length of Stay (days)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Neuro Stepdown ICU has the shortest length of stay, while the Surgical intensive care unit has the shortest length of stay.